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[ NTERLISP  PERFORMANCE  MEASUREMENTS 


Introduction 


The  advantages  of  LISP  for  fast  production  of  large  software 
systems,  esoecially  those  involving  Artificial  Intelligence 
applications,  are  too  well  known  for  us  to  expound  on  here.  Suffice  it 
to  say  that  systems  such  as  DENDRAL ,  MACSYMA,  SOPHIE,  SCHOLAR,  LUNAR, 
etc.,  could  not  have  evolved  and  could  not  have  been  developed  within 
the  time  and  level  of  effort  they  actually  reauired  to  complete,  had  it 
not  been  for  the  existence  of  a  sophisticated  LISP  programming 
environment,  of  which  perhaps  INTERLISP  is  the  best  known. 


But  as  all  users  of  these  systems  know, 
sophist icated  orogrammino  environments  are 
counteroar ts.  Althouah  programminq  (debuagi 
runnino  finished  products  written  in  INTERLISP 
fast  clip  when  the  total  load  on  the  machine 
performance  increases  rapidly  -  and  seemi 
intolerable  levels  as  soon  as  laroe  numbers 
demands  imposed  on  the  computer's  resources. 


An  indication  of  how  bad  things  can  be  is 
typical  example.  In  a  busy  mornina,  a  simole 
that  uses  under  300  milliseconds  of  CPU  time  ta 
magnitude  more  elapsed  time  than  would  be  expec 
on  the  machine.  Thus,  extremely  slow  responsi 
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small  computational  demands  is  one  of  the  serious  problems  that  we  shall 
investigate. 

Another  aspect  of  this  performance  degradation  manifests  itself  in 
the  behavior  of  CPU  bound  jobs,  such  as  compiling.  In  spite  of  the 
TENEX  oie-slice  scheduler's  guaranteed  fraction  of  CPU  power,  compute 
bound  INTERLISP  jobs  rarely  aet  more  than  Sh*  of  their  guaranteed  CPU 
power  when,  again,  the  load  imposed  on  the  machine  by  other  users 
increases  beyond  certain  limits. 

The  above  cited  typical  situations  provide  the  motivation  and  the 
framework  for  the  work  to  be  described.  Our  objective  was  to  pin  down 
what  aspects  of  the  "INTERLISP  cum.  TENEX"  environment  were  responsible 
for  the  observed  objectionable  behavior,  and  to  propose  and  implement 
remedies  to  improve  the  situation.  More  specifically,  our  goals  were  to 
improve  system  responsiveness  for  short  interactions  (e.g.  editing)  and 
to  increase  system  efficiency  and  throughput  when  executinq  CPU  bound 
jobs . 


To  this  end  we  performed  an  extensive  series  of  measurements 
coverino  a  variety  of  aspects  of  system  behavior.  We  obtained 
statistics  of  usaae  of  INTERLISP  from  different  users  doina  different 
thinqs;  we  traced  the  wav  INTERLISP  uses  core,  both  in  the  space  and  the 
time  dimensions;  and  we  identified  with  hiqh  resolution  the  areas  where 
most  instruction  executions  take  Place,  i.o.  where  and  doinn  what  the 
system  spends  most  of  its  time. 


Whoever  has  tried  to  understand  with  precision  the  behavior  of  a 
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complex  system  immersed  in  a  time-sharinq  system  knows  how  difficult  it 
is  to  actually  measure  what  one  wants,  and  how  hard  it  is  to  interpret 
the  data  one  finally  obtains.  For  this  reason,  we  shall  endeavor  to 
describe  faithfully  the  methods  and  procedures  used  to  obtain  our  data, 
the  conditions  under  which  it  was  obtained,  and  our  reasons  for 
assertina  that  it  means  what  we  believe  it  does.  Our  aim  is  not  only  to 
describe  our  work  and  justify  our  results,  but  also  to  make  the  data  and 
the  methodoloqy  used  to  obtain  it  available  to  others  that  may  find  it 
useful  for  their  own  purposes. 

Before  embarkinq  in  this  voyaqe,  however,  let  us  advance  here  our 
main  conclusions  for  the  benefit  of  readers  not  wishing  to  wade  through 
the  rest  of  the  paper. 

1)  the  lack  of  responsiveness  (disproportionately  long  elapsed 
times  for  relatively  modest  comoutational  demands,  or  waitinq  20 
seconds  when  3  seconds  should  have  been  enouqh)  is  due  to  both  the 
larae  workino  set  needed  bv  INTERLISP  and  the  particular  way  the 
TENEX  operatino  system  allows  the  core-memory  allocated  to  a 
process  to  orow  to  the  process'  workina  set  size.  In  order  for  any 
significant  amount  of  useful  computation  to  take  place  in 
INTERLISP,  it  is  necessary  to  have  from  60  to  100  pages  in  core; 
with  less  than  that,  proqram  execution  is  interrupted  by  page 
faults  at  intervals  of  a  millisecond  or  less.  Since  TENEX  does  not 
do  any  oreloadinq,  and  forces  a  process  to  grow  its  working  set  by 
page  faulting  itself  up  from  pane  1,  it  takes  rouqhly  10  seconds 
(at  100  milliseconds  wait  time  for  latency,  page  management 
routines,  and  rescheduling  delays)  to  build  up  an  INTERLISP  working 
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set.  At  this  point  (or  even  sooner)  however,  since  page  fault 
interrupts  are  considered  part  of  the  process'  charqeable  CPU  time, 
the  process  would  have  exceeded  its  quantum  allocation  on  the  hiqh 
priority  interactive  aueue,  and  descend  to  a  lower  priority 
scheduling  aueue.  If  by  the  time  the  second  startup  of  the  process 
occurs  (on  the  lower  priority  aueue),  the  process  is  still  in  the 
balance  set  (i.e.  has  most  of  its  pages  in  core)  some  meaningful 
computation  can  then  begin  to  take  place.  If  not,  the  same  painful 
page  by  page  reloading  process  takes  place  again. 

The  remedies  to  this  situation  are  direct: 

a)  reduce  the  size  of  INTERLISP's  working  set 

b)  modify  TENEX  so  that  demand  paging  occurs  after  initial 

preloadinq  of  the  previous  working  set  for  the  job. 

2)  Our  second  main  conclusion  addresses  the  issue  of  basic 
efficiency.  It  pertains  more  definitely  to  the  INTERLISP  system 
itself,  and  less  to  TENEX,  and  has  its  major  imoact  on 
compute-bound  processes.  Briefly,  we  found  that  roughly  80%  of  the 
instruction  fetches  occur  within  30  pages  of  shared  virtual  address 
space,  chiefly  the  MACRO  (or  hand-coded)  module.  In  other  words, 
the  system  spends  a  majority  of  its  time  executing  instructions 
within  a  relatively  small  and  functionally  well-defined  area  of  the 
INTERLISP  address  space.  It  follows  that  tightening  and 
streamlining  code  in  those  sections  should  bring  about  the  largest 
payoffs  in  terms  of  system  operating  efficiency.  Our  measurements 
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in  this  regard  were  of  very  high  resolution.  We  were  able  to  pin 
point  the  most  often  used  R  word  blocks  of  code  in  the  MACRO  area, 
pointing  out  in  great  detail  where  improvements  were  needed.  Our 
work  on  shallow  bindinq,  fast  function  entry,  fast  tyDe  checking 
and  fast  CONSina  responds  to  these  documented  bottlenecks. 
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MEASUREMENTS 

In  order  to  characterize  the  computational  patterns  of  INTERLISP 
runninq  under  TENEX,  two  distinct  types  of  measurements  were  made.  One 
set  of  measurements  involved  the  pattern  of  usage  of  TENEX  resources  by 
several  INTERLISP  users  over  an  extended  period  of  time.  The  second  set 
of  measurements  involved  detailed  examination  of  the  underlying  activity 
of  the  INTERLISP  system  itself  as  it  was  performing  a  number  of  typical 
tasks . 

Usage  patterns  and  modes  oj!  interaction 

A  very  suitable  strateqy  for  the  amelioration  of  INTERLISP 
performance  is  to  concentrate  on  those  patterns  of  usage  that  involve  a 
large  and  perhaps  unnecessary  amount  of  computational  power,  memory, 
and/or  the  user's  own  time.  In  order  to  do  this,  we  needed  to 
characterize  the  actual  use  of  INTERLISP  by  normal  users  going  about 
their  daily  business. 

The  first  set  of  measurements  used  the  built-in  TENEX  job  parameter 
statistics  (CPU  time  charged,  elapsed  time,  pane  faults,  time  charged 
within  paqe-fault  routines)  to  monitor  the  activity  patterns  of  several 
typical  users  over  an  extended  period  of  time. 

The  data  qiven  is  for  a  moderately  long  run  (about  200  events)  of 
editing,  compiling  and  associated  debugging  operations,  typical  of  much 
of  the  activity  of  the  LISP  community.  An  event  is  defined  as  a  single 
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operation  the  user  reouires  INTERLISP  to  do  (like  a  PP  or  DW  command  in 
the  editor,  or  a  compilation)  which  results  in  a  certain  amount  of  CPU 
time  beinn  consumed.  The  total  CPU  time  reauired  for  the  run  is  about  2 
minutes. 

We  olot  three  quantities.  First  we  solit  the  events  up  into  qroups 
defined  by  a  range  of  required  CPU  time  for  the  event  (thus  one  qroup 
miqht  be  events  which  took  between  20(1  and  250  milliseconds  of  CPU  time 
to  complete).  For  each  qrouo  we  then  plot  the  proportion  of  the  total 
CPU  time  used  by  the  job  which  is  attributable  to  events  in  that  qroup. 
The  qraph  indicates  the  chosen  CPU  time  ranqes  for  events.  The  upper 
value  of  the  ranqe  is  qiven  in  the  left  hand  column,  and  the  lower  value 
is  the  previous  unper  value  (all  values  given  in  milliseconds).  The 
number  of  events  requirina  CPU  times  withn  the  ranqe  are  given  in  the 
second  column,  and  the  percentage  of  the  total  CPU  usage  attributable  to 
those  events  is  qiven  in  the  third  column.  The  percentaoe  is  also 
plotted  as  a  bar  qraph  immediately  to  the  riaht,  with  scale  qiven  below. 
Note  that,  if  we  define  interactive  events  as  those  that  reauire  CPU 
times  of  less  than  300  milliseconds,  only  21*  of  the  total  CPU  is  used 
in  interactive  operations.  This  is  actually  somewhat  higher  than  we 
have  seen  in  the  uncontrolled  data  taken  from  several  typical  users  - 
this  run  involves  a  lot  of  editinq.  Thus,  the  lion's  share  of  the  CPU 
load  placed  on  the  system  is  in  lonq,  CPU-bound  activities. 

The  second  graph  shows  the  number  of  events  requiring  CPU  times 
within  each  of  the  chosen  ranges.  It  shows  auite  Graphically  that  2/3 
(67*)  of  the  total  number  of  events  represents  "interactive”  work. 
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Thus,  any  degradation  of  performance  resulting  in  the  increase  of 
elapsed  time  for  events  (particularly  interactive  events)  will  be  very 
strongly  felt  by  the  user  -  with  tremendous  frustration  -judqino  bv 
typical  reactions. 

Both  of  these  clots  are  given  in  terms  of  net  CPU  time  -  this  means 
that  for  each  interaction  we  have  subtracted  the  time  that  TENEX 
indicates  was  spent  in  the  pane-faulting  routines,  since  that  time 
varies  strongly  with  load.  This  qives  an  indication  of  the  "basic  time" 
spent  in  the  various  interactions.  An  indication  of  the  additional  CPU 
time  billed  because  of  oaae-faulting  is  given  in  the  third  graph  which 
gives  percentage  of  interactions  by  gross  CPU  time  -  this  includes  all 
TENEX  billed  time,  including  paae  faulting.  For  this  examDle  run  at 
relatively  low  load,  there  were  2642  faults,  representing  approximately 
11(300  milliseconds  of  time  recorded  as  spent  in  the  oaae  faulting 
routines.  The  overall  aross  CPU  time  is  74,^00  milliseconds.  Thus, 
about  15*  of  the  billed  time  is  due  to  page  faulting. 

The  same  three  graphs  are  given  for  an  almost  identical  run  at 
moderate  load.  While  the  first  two  qraphs  are  rouqhly  similar  (with  the 
exception  of  a  "spike"  at  the  .5-1.0  second  reange  due  to  a  number  of 
error  recovery  (DWIM  type)  operations  caused  by  execesive  mistyping), 
the  third  graph  shows  that  there  is  a  notable  increase  in  CPU  time 
billed  because  of  the  increase  in  page-f aul tinq .  The  total  aross  CPU 
time  is  132250  milliseconds,  with  16053  page  faults,  accounting  for 
approximately  34,000  milliseconds  or  over  26%  of  the  billed  CPU  time. 
It  is  interestinq  to  note  that  there  is  a  difference  of  about  34  seconds 
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of  net  cpu  time  between  these  two  runs  as  obtained  by  subtracting  the 
TENEX  reported  page  fault  time  from  the  gross  CPU  time.  Since  the 
reported  billed  time  per  page  fault  was  over  4  milliseconds  during  low 
load,  and  about  2  milliseconds  during  high  load,  it  is  possible  that 
more  time  was  spent  in  the  page  fault  routines  during  the  high  load 
situation  than  was  recorded  by  TENEX.  Of  course,  the  slight  change  in 
the  run  accounts  for  some  part  of  the  difference,  but  probably  not  more 
than  half.  The  page  faulting  behavior  is  examined  in  more  detail  in  a 
later  section. 
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Editing,  compi ling  dwimifying  and  clispi fving  -  light  load 


AVERAGE  LOAD  IS:  1.265496 

Total  net  CPU  time:  63,550  milliseconds 
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Total  gross  CPU  time:  74,800  milliseconds 
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Editing,  compiling,  dwimi f ving  and  clispjfving  -  moderately  heavy  load 


AVERAGE  LOAD  IS:  6.637456 

Total  net  CPU  time:  08,2 50  milliseconds 
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I 

Total  press  CPU  time:  132t250  milliseconds 


MAX.  CPU  !  PERCENTAGE  OF  INTERACTIONS  OF  GIVEN  GROSS  CPU  TIME 
MILLISEC  ! 


50 

!  0.0 

till 

1  1  1  I 

I  1 

1  1 

1 

1 

100 

i  1.5 

1  * 

150 

!  7.5 

[*»«###« 

200 

8.0 

1 

250 

!  10.5 

1 

300 

!  11.0 

j  »***•*«#*** 

350 

!  7.5 

[»•«**»• 

400 

!  8.5 

I «•««»«*» 

450 

!  4.0 

!  *«*« 

500 

!  5.5 

>*»*»* 

1000 

!  10 .0 

I*#**#**#**##*##**#* 

1500 

!  3.5 

I ««« 

2000 

!  3.0 

1  «** 

2500 

!  2.0 

I  •• 

3000 

!  0.0 

I 

I 

3500 

!  1.0 

I  * 

4000 

!  1.0 

!  * 

4500 

!  .5 

» 

1 

5000 

!  .5 

I 

1 

10000 

1  .5 

} 

1 

15000 

1  1.0 

'  * 

20000 

!  0.0 

1 

1 

40000 

!  .5 

I 

1 

60000 

!  o.o 

i 

1 

120000 

!  0.0 

i 

l 

180000 

!  o.o 

1 

l 

240000 

!  o.o 

f 

l 

200000 

!  0.0 

l 

l 

360000 

!  0.0 

1 

1 

•INF* 

i  0.0 

I 

1 

l  1 

i  1 

1  1  1  l 

1  1  1  l 

0  5  10  15  ?0  25  30  35 


40 


45 


15 


BBN  Report  No.  3331 


Bolt  Beranek  and  Newman  Inc. 


Change  in  Interfault  interval  between  low  an d  moderate  loads 

As  described  above,  one  of  our  observations  is  that  the  major 
non-linear  effect  of  machine  load  occurs  as  a  result  of  vastly  increased 
page  faulting,  particularly  for  short,  supposedly  "interactive"  jobs. 
We  plot  the  average  time  (net  CPU  -  not  counting  time  TENEX  indicates  to 
be  "page  fault  time")  between  page  faults  for  different  net  CPU  length 
interactions.  This  data  is  plotted  for  the  two  runs  of  the  typical 
editing,  compiling,  etc.  job  described  above,  at  two  different  load 
averages.  Note  that  we  only  have  control  over  load  average,  we  do  not 
have  any  direct  measurement  of  actual  memory  contention.  The  low  load 
average  run  is  at  a  value  slightly  higher  than  the  "dead  of  night  load", 
but  roughly  comparable.  The  high  load  average  run  is  only  normal 
moderate  afternoon  loading  -  that  is  already  bad  enough  in  terms  of  page 
faulting,  so  that  really  horrible  load  averages  such  as  the  10-20  range 
are  not  shown  (and  heaven  forbid  the  20-30  load  range). 
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Low  load  average  -  about  1 . 2 
2642  total  pape  faults 
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Moderate  load  average  -  atnjut.  6 . 6 
16053  total  page  faults 
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The  measurements  just  described  provide  a  picture  of  what  "TENEX 
believes"  are  the  characteristics  of  LISP  execution  in  a  time-shared 
environment.  We  use  the  expression  "TENEX  believes"  because,  as  in  any 
time-sharing  environment,  the  usage  parameters  shown  for  a  given  job 
depend  heavily  on  the  system  load  over  the  course  of  the  job.  In  part 
this  is  due  to  the  necessarily  approximate  allocation  of  system  overhead 
among  the  active  jobs,  which  appears  as  an  addition  to  the  computational 
resources  the  jobs  would  consume  if  they  were  running  alone.  More 
important,  however,  is  the  fact  that  both  the  actual  amount  of  overhead 
and  the  allocation  of  this  overhead  to  different  jobs  varies 
substantially  with  different  job  mixes.  A  job  with  given  memory 
requirements  for  example,  will  page-fault  much  more  often  when  it  is 
competing  for  core  space  with  other  memory-hungry  jobs  (or  many 
small-memory  jobs)  than  when  it  is  running  in  less  memory-competitive 
environments.  Handling  these  page  faults  results  in  additional  overhead 
(CPU  time)  charged  to  the  job. 

Excessive  page-faulting  causes  a  dramatic  lengthening  of  the 
elapsed  time  for  a  job  riot  only  because  disk  latency  increases  the 
effective  cycle  time  for  memory  references  but  because,  more  importantly 
perhaps,  such  behavior  can  interact  with  the  scheduler,  resulting  in  a 
job  with  basically  interactive  CPU  reaui rements  (a  small  fraction  of  a 
second  of  CPU  time  needed  between  interactions  with  the  user)  being 
dropped  from  the  high-priori ty  interactive  queue  and  placed  on  the 
less-freoueritly  serviced  compute-bound  queues.  I/O  contention  causes 
similar  problems  in  increasing  overhead  and  wait  times  for  jobs 
competing  for  use  of  shared  devices  such  as  the  disk.  Thus,  for  no 
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fault  that  is  intrinsically  their  own,  certain  jobs  may  be  penalized 
because  their  overhead-burdened  CPU  consumption  makes  the  scheduler 
decide  that  they  belong  in  a  lower-priority  queue.  In  situations  of 
high  memory  contention  this  effect  can  pyramid,  because  during  the  wait 
on  the  low  priority  queue  the  job  may  have  most  of  its  in-core  pages 
removed  from  core,  and  thus  have  to  fault  many  more  times  than  it  would 
have  had  to  if  it  were  allowed  to  finish  its  short  CPU  interaction. 

In  short,  the  usage  parameters  vary  because  the  memory  load  and  CPU 
demand  on  the  system  change  with  different  mixes  of  jobs,  and  these  load 
factors  strongly  affect  the  interaction  of  a  user  program  (e.g. 
INTERLISP)  and  the  TENEX  memory  manager,  i/o  drivers  and  scheduler. 
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Memory  and  CPU  usage  of  INTERLISP  as  a  separate  system 

The  TENEX  statistics  correlated  well  with  what  the  "monitored" 
users  experienced  (and  thus  what  the  "typical  user"  would  be  likely  to 
experience)  in  operating  INTERLISP  under  TENEX.  While  these  statistics 
suggested  several  changes  to  the  TENEX  system,  they  were  insufficient  to 
provide  a  guide  to  the  modif ications  to  INTERLISP  which  would  most 
improve  the  operation  of  the  combined  INTERLISP/TENEX  system.  This  was 
due  both  to  the  coarseness  of  the  measurements  with  regard  to  the 
operation  of  INTERLISP  itself  as  an  independent  job,  as  well  as  to  the 
great  difficulty  of  characterizing  the  details  of  the  actual  interaction 
between  the  two  systems  (or  even  characterizing  the  system  load 
parameters  which  prevailed  during  the  measurements).  Thus  it  was 
necessary  to  obtain  an  entirely  independent  characterization  of  the 
memory  and  CPU  usage  of  INTERLISP  in  executing  typical  operations. 

This  independent  characterization  consisted  of  a  series  of  related 
measurements  based  on  a  PDP-10  simulator  program  running  under  TENEX. 
The  simulator  is  a  program  which  sits  in  a  user's  address  space,  and 
essentially  single-steps  through  a  user  program.  The  simulator  takes 
over  from  the  PDP-10  hardware  the  job  of  computing  the  effective 
addresses  for  each  of  the  user  program's  instructions,  and  provides 
hooks  to  allow  a  measurement  program  to  record  the  memory  reference 
pattern  of  the  user  job  in  any  degree  of  detail  desired.  It  is 
important  to  note  that  the  simulator  sees  a  JSYS  monitor  call  as  one 
instruction  -  NO  ANALYSIS  IS  MADE  OF  TIME  SPENT  IN  THE  TENEX  MONITOR 
DOING  I/O  AT  THE  USER'S  BEHEST.  Thus,  any  program  involving  i/o  will 
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seem  to  execute  fewer  in  tructions  (as  counted  by  the  simulator)  than 
are  actually  executed  when  the  program  itself  is  run  on  the  PDP-10. 
There  are  many  other  subtleties  involved  in  understanding  precisely  how 
the  simulator  works  and  how  the  data  was  analyzed.  However,  we  feel 
that  these  details  are  best  discussed  after  we  have  presented  the  gist 
of  the  measurement  results. 


Page  Faulting  versus  Allowed  Working  Set 

INTERLISP  has  acquired  a  reputation  as  a  "Core  hog"  -  a  program 
that  requires  huge  amounts  of  core  in  order  to  run.  One  of  the  most 
interesting  things  to  do  with  the  page  reference  data  is  to  determine 
exactly  how  much  core  INTERLISP  needs  to  run.  Of  course  this  is  a 
poorly  defined  question  -  what  is  interesting  is  the  tradeoff  between 
the  expected  number  of  page  faults  (or  the  expected  time  between  page 
faults)  and  the  number  of  pages  allowed  in  the  working  set.  It  is 
difficult  to  determine  the  tradeoff  mentioned  above  in  the  case  of 
TENEX,  because  the  page  management  algorithms  in  TENEX  are  rather 
complicated  and  are  influenced  by  the  existence  of  pages  shared  among 
several  processes  (which  may  cause  TENEX  to  lose  track  of  the  last  time 
a  given  process  used  a  shared  cage).  Thus,  we  have  resorted  to  using 
the  page  usage  data  in  conjunction  with  a  simplified  page  management 
model  in  order  to  give  some  indication  of  the  effect  of  working  set  size 
on  page  fault  rate. 
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We  have  produced  graphs  showing  the  number  of  page  faults  expected 
for  several  measured  programs  for  allowed  working  set  sizes  ranging  from 
about  40  to  200  pages,  using  an  approximation  to  a  simple  page 
management  algorithm.  The  assumed  page  management  routine  is  a  simple 
LEAST  RECENTLY  USED  (LRU)  algorithm  working  with  a  fixed  size  working 
set.  Thus,  when  a  process  starts  up  it  begins  to  fault  in  pages,  until 
it  has  brought  in  as  many  pages  as  there  are  allowed  in  the  particular 
fixed  size  of  the  working  set.  The  next  time  that  a  page  not  in  the 
working  set  is  referenced,  the  page  in  the  working  set  least  recently 
referenced  is  removed  from  the  Working  set  and  replaced  with  the  new 
page.  The  same  process  goes  on  for  each  page  referenced  which  is  not  in 
the  current  working  set.  It  is  possible  to  simulate  the  behavior  of 
such  a  page  management  algorithm  for  different  fixed  size  working  sets 
and  to  determine  the  number  of  page  faults  that  would  result  for  a  given 
process  for  which  we  have  cage  reference  data. 

We  present  below  the  graphs  of  page  faults  versus  allowed  fixed 
working  set  size  for  three  typical  program  executions,  and  include 
tabular  data  for  other  measured  programs  in  the  appendix.  A  number  of 
inferences  can  be  drawn  from  them,  depending  ori  various  assumptions  that 
might  be  made  about  paging  behavior  on  TENEX  and  on  the  parameters  of 
interest . 

The  first  example,  referred  to  as  DWCL,  involves  three  typical  user 
operations  invoked  under  the  LISP  editor  -  "dwimifying"  an  expression, 
"clispifying"  an  expression,  and  PRETTYPRINTing  the  expression.  (In  the 
appendix  we  present  the  data  for  a  much  longer  run,  called  EDIT/CLEANUP, 
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involving  many  editing  operations  of  substitution,  structure  changing, 
etc.,  obtained  by  repeating  a  protocol  of  an  actual  large  scale 
debugging  session  using  the  simulator.)  The  second  example,  referred  to 
as  REGCOM,  involves  the  compilation  of  a  set  of  functions  which  are 
already  in  core  (i.e.  COMPILE  as  against  TCOMPL,  so  no  file  reading 
operations  are  included).  The  third  example  is  the  operation  of  the 
structure  generator  from  the  DENDRAL  program,  generating  the  possible 
structures  of  the  compound  0*1  H6  (it  is  referred  to  as  CONGENSIM  -  the 
CONGEN  simulation). 

The  graphs  of  page- f au 1 t ing  behavior  for  these  examples  are  given 
below.  The  first  column  (labelled  "WORKING  SET  SIZE")  gives  the  number 
of  pages  allowed  to  accumulate  in  core  before  the  LRU  algorithm  is  used 
to  replace  old  pages  with  new  ones  (causing  "'age  faults).  The  second 
column  (labelled  "PAGE  FAULTS")  is  the  number  of  page  replacements  that 
occur  for  the  Corresponding  working  set  size.  These  two  columns  give  a 
complete  tabular  representation  of  the  data.  The  data  is  graphed  to  the 
right  of  the  tabular  representation,  with  the  Y-axis  being  allowed 
working  set  size  (as  given  in  the  first  column),  and  the  X-axis  being 
the  number  of  page  faults  per  20000  memory  references  (this  serves  to 
make  the  graphs  of  different  runs  more  comparable),  with  the  scale  for 
the  number  of  page  faults  being  given  below  the  graph.  As  is  indicated, 
there  are  approximately  1.2  million  memory  references  which  took  place 
in  the  course  of  the  dwimi  f  ication  ,  cl  ispif  ication  and  PRETTYPRINTirig . 
Note  that  the  number  of  instructions  executed  in  monitor  mode  (for  the 
i/o  in  PRETTYPRINTirig)  are  not  accounted  for,  nor  are  any  instructions 
executed  in  TENEX  for  page  management,  scheduling,  etc. 
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Example:  DWCL 

1204224  Memory  references  in  example 
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Example:  REGCOM 

2037760  Memory  references  in  example 
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Example:  CONGENSIM 

2283520  Memory  references  in  example 
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Interpretation  of  page  faulting  results 


There  are  a  number  of  subtleties  that  should  be  borne  in  mind  in 
looking  at  the  data.  In  the  first  place,  the  number  of  page  faults  is 
given  assuming  that  the  lob  starts  from  scratch,  with  no  pages  in  core. 
Once  the  job  is  running,  it  is  able  to  keep  its  entire  allowed  working 
set  with  no  losses,  throughout  the  entire  run,  simply  bringing  in  new 


27 


!  BBN  Report  No.  3331 


. . . . — - i 

Bolt  Beranek  and  Newman  Inc. 


pages  and  swapping  out  LRU  pages.  If  one  wishes  to  make  an  estimate  of 
the  frequency  of  page  faults  "in  the  steady  state"  for  a  compute  bound 
job,  one  should  probably  assume  that  the  job  has  its  full  working  set 
in,  and  count  faults  after  that.  Thus,  for  this  purpose,  one  should 
subtract  the  size  of  the  allowed  working  set  from  the  fault  count  for 
the  given  working  set,  in  order  to  determine  how  many  faults  occurred  in 
the  steady  state  condition.  However,  if  one  is  considering  the  number 
of  faults  likely  to  occur  if  the  interaction  starting  the  example  occurs 
several  seconds  after  the  last  user  interaction,  then  the  number  of  page 
faults  as  stated  are  meaningful  under  the  standard  TENEX  page  management 
operation  -  by  the  end  of  a  few  seconds  of  waiting  for  the  user  to 
initiate  an  interaction,  the  user's  program  is  probably  no  longer  in 
core  because  of  Competition  with  other’  jobs  demanding  memory  in  order  to 
run  . 


The  substantial  flurry  of  page  faults  needed  to  start  up  an 
interaction  when  the  program  is  not  in  core  might  account  for  the 
difference  in  responsiveness  felt  between  night-time  and  daytime  running 
of  INTERLISP  -  at  night  there  are  times  when  the  number  of  users  is 
small  enough  that  the  core  allocation  for  a  user  does  not  decay  for 
quite  a  while  -  conceivably  two  or  three  LISPs  could  reside  in  core  and 
not  be  swapped  out  while  waiting  for  user’s  responses.  Thus,  the 
response  to  a  request  (similar  to  previous  requests  in  terms  of  the 
particular  pages  needed  to  execute  the  request)  can  occur  immediately, 
with  relatively  little  page  faulting.  During  heavier  memory  contention 
times,  the  same  request  may  require  over  a  hundred  page  faults  just  to 
initialize  the  working  set.  In  turn,  the  charge  for  this  faulting  may 
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I 

I 


3 


IB 


take  the  job  off  the  interactive  queue  and  thus  cause  a  delay  until  the 
job  starts  up  on  the  lower  queue  -  during  which  time  the  pages  used  by 
the  job  can  start  to  trickle  out  due  to  contention  by  other  jobs. 

Assuming  that  the  jobs  in  question  have  a  high  enough  priority  to 
run  to  completion  without  being  removed  from  core,  one  can  ask  how  the 
billed  CPU  time  for  the  job  varies  as  a  function  of  the  allowed  working 
set  size.  Assuming  that  TENEX  charges  an  average  of  about  3 
milliseconds  of  CPU  time  per  page  fault,  a  page  faulting  rate  of  one 
fault  per  3  milliseconds  would  double  the  charged  time  for  the  job.  By 
comparing  the  number  of  memory  references  reported  by  the  simulator  to 
the  billed  CPU  time  for  a  given  job  (subtracting  off  time  TENEX 
attributes  to  paging)  wo  find  that  each  memory  reference  accounts  for 
about  1.5  microseconds  of  CPU  time  (this  includes  memory  reference  time, 
pager  time,  and  the  time  to  execute  instructions)  on  the  average.  Thus, 
the  billed  time  doubles  when  there  is  one  page  fault  every  2000  memory 
references.  In  the  editing  run  this  corresponds  to  a  working  set  size 
of  approximately  115  pages,  for  the  compilation  example  to  about  100 
pages,  and  for  the  C0NGENSIM  example  to  about  76  pages.  (These  figures 
are  based  on  the  total  number  of  page  faults  given  by  the  simulator  as 
plotted  above.  For  the  "steady  state",  correspon ding  sizes  are  about 
100 ,  70 ,  and  68 ) . 


Another  interesting  question  is  how  the  potential  elapsed  time  for 
a  job  varies  depending  on  the  working  set.  If  one  assumes  that  the 
minimum  time  it  takes  to  fetch  a  page  from  disc  is  about  one  disc 
latency  plus  the  TENEX  billed  CPU  time  per  fault,  one  can  say  that  the 
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minimum  elapsed  time  for  a  faulted  reference  is  about  30  milliseconds, 
corresponding  to  about  20000  regular  memory  references.  Thus,  a  page 
fault  rate  of  one  fault  per  200C0  references  would  cause  a  doubling  of 
potential  elapsed  time.  Other  estimates  of  effective  elapsed  time  per 
fault  can  be  made,  to  take  into  account  scheduling  overhead  and  waits, 
etc.  These  estimates  range  up  to  100  milliseconds  per  fault.  This 
would  correspond  to  65000  references.  There  is  also  a  question  as  to 
what  constitutes  an  acceptable  increase  in  elapsed  time.  On  the  pie 
slice  scheduler,  if  the  user  has  a  10$  slice,  then  a  multiplication  of 
elapsed  time  by  10  (due  to  waits  for  faults  or  due  to  scheduling)  is  not 
unreasonable.  This  corresponds  to  somewhere  between  2000  and  6500 
references  per  page  fault,  depending  on  estimates  of  elapsed  time  to 
resolve  a  fault. 
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Composition  of  a  Working  Set 

Given  that  one  intends  to  reduce  the  working  set  of  INTERLISP  in 
order  to  reduce  page  faulting,  the  question  arises  as  to  what  the 
working  set  of  a  typical  program  is  made  up  of.  Since  the  concept  of 
LISP  is  associated  with  the  notion  of  list-structure  and  the  existence 
of  large  data  bases  of  list  structure,  one  might  expect  that  much  of  the 
working  space  is  tied  up  in  list  structure.  Given  this,  one  might  try 
to  reduce  the  working  set  by  such  techniques  as  linearization  and 
compacti f ication  of  list  structure.  In  fact,  for  the  programs  measured, 
lists  take  only  a  relatively  small  amount  of  the  working  space  relative 
to  other  items. 


Taking  the  page  reference  data,  we  simulated  an  LRU  algorithm  for 
four  sizes  of  working  set  -  75  pages  (a  rather  cramped  set),  7 00  pages 
(still  small),  125  (reasonable),  and  150  pages  (a  fairly  generous  one). 
At  intervals  we  determined  which  pages  were  in  the  working  set  and  what 
their  data  type  was.  We  distinguished  among  several  different  types  of 
data  - 


MACRO  -  hand  code  part  of  system 

COMPILED  CODE  -  array  space  with  instruction  fetch  references 

ARRAYS  -  array  space  with  no  instruction  fetches 

STACKS  -  control  and  variable  binding  stacks 

LISTS  -  CONS  cell  area 

ATOMHT  -  hash  table  for  atoms 

ATOMS  -  atom  header1  area 

PNAME  -  print  names  of  atoms 

STRING  -  characters  in  strings 

STRING  POINTERS  -  pointers  to  bounds  of  individual  strings 
FIXED  NUMBERS  -  fixed  point  numbers 
OTHER  -  stack  pointers,  etc. 
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The  results  for-  single  programs  seemed  fairly  stable  in  time,  and 
reasonably  consistent  from  one  program  to  another.  We  have  plotted  the 
composition  of  the  working  set  for  several  programs,  and  include 
complete  tabular  data  here.  The  data  given  averages  the  composition  of 
the  working  set  over  the  course  of  each  program's  execution,  the 
time-varying  data  are  available,  but  do  not  seem  to  be  of  any  greater 
interest  than  the  averaged  data. 
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COMPILEMEASURE 
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numbers  signify  data  types : 


1  =  MACRO  code 

2  =  COMPILED  code  and  ARRAY 

3  =  LISTS 

4  =  ATOMS 

5  =  ATOMHT 

6  =  PNAMES 

7  =  STRINGS 

8  =  STRPTRS 

9  =  FIXNUMS 
S  =  STACK 
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CONGENSIM 
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Legend  -  numbers  signify  data  types : 

1  =  MACRO  code 

2  =  COMPILED  code  and  ARRAY 

3  =  LISTS 
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PARSEMEASURE 
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Legend  -  numbers  signify  data  types : 


1  =  MACRO  code 

2  =  COMPILED  code  and  ARRAY 

3  =  LISTS 

4  =  ATOMS 

5  =  ATOMHT 

6  =  PNAMES 

7  =  STRINGS 

8  =  STRPTRS 

9  =  FIXNUMS 
S  =  STACK 
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On  the  aver ape,  over  half  of  the  Working  set  is  taken  up  with 
program.  The  MACRO  code  seems  to  be  referenced  quite  often,  as 
indicated  by  the  fact  that  all  the  MACRO  code  needed  by  a  program  seems 
to  be  in  the  working  set  for  100  pages,  and  no  extra  MACRO  code  comes  in 
at  150  pages.  Thus,  as  you  go  from  100  to  150  pages  the  "execution 
code"  that  is  added  is  almost  entirely  compiled  LISP.  Note  also  that 
atoms  and  their  ancillary  storage  are  heavily  referenced  -  adding 
together  AT0MHT,  ATOMS  and  PNAMES  one  gets  over  20$  of  the  working  set. 
The  remaining  25$  is  divided  up  among  the  other  items,  with  list 
structure  taking  only  10-15$  of  the  space. 

This  data  suggests  that  the  three  best  places  to  look  to  reduce 
working  set  size  are  MACRO  code,  compiled  code  and  atoms.  Other  data 
reported  below  indicate  that  while  20  pages  of  the  MACRO  Code  are 
referenced,  fewer  than  5000  words  (10  pages)  of  the  MACRO  code  are 
actually  used  iri  running  the  given  examples  (e.g.  the  MACRO  code  used 
for  error  recovery,  backtracing,  etc.  are  not  being  used,  but  they  are 
intertwined  with  the  other  code).  Thus,  by  reorganizing  the  MACRO  code 
about  10  pages  can  be  saved.  It  is  possible  that  good  reorganization 
can  do  even  better  by  taking  into  account  the  statistical  patterns  of 
references  within  the  MACRO  code  to  group  together  instructions  commonly 
used  together.  Because  the  compiled  code  is  the  largest  single  data 
type,  it  is  reasonable  to  spend  time  looking  to  improve  the  compiler  to 
produce  more  compact  code.  A  10$  reduction  in  size  of  the  compiled  code 
Could  reduce  the  working  set  by  3  to  5  pages.  Finally,  the  large  amount 
of  space  used  by  atoms  and  their  ancillary  data  suggests  that 
compact i fication  of  atoms  might  be  useful.  In  the  current  system  each 
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atom  requires  four  words  of  storage  plus  its  PNAME  -  in  order  to  allow 
it  to  have  a  top  level  value,  a  property  list  and  a  function  call.  The 
hash  table  entry  is  one  word,  and  the  atom  header  takes  three  words, 
since  it  must  hold  a  full-word  function  cell,  the  PNAME  pointer,  the 
property  list  pointer  and  the  value  pointer.  Other  data  we  have 
collected  suggest  that  this  is  quite  wasteful,  that  few  atoms  have  all 
three  features,  and  that  many  atoms  are  used  entirely  as  "indicators" 
and  have  only  their  PNAME  and  no  property  list,  value  or  function 
definition.  It  is  conceivable  that  this  might  be  taken  into  account  in 
designing  a  new  structure  for  atoms. 
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Counts  of  references  to  various  page  types 


One  way  of  determining  the  general  pattern  of  activity  of  INTERLISP 
is  to  find  the  actual  number  of  references  to  a  certain  type  of  page 
during  the  run  of  a  program.  We  collected  this  data  and  obtained  two 
surprising  results  -  even  for  large  compiled  programs,  over  80?  of  the 
instructions  executed  were  actually  ones  in  the  hand-coded  part  of  the 
LISP  kernel;  although  LISP  is  associated  with  the  concept  of 
list-processing,  fewer  than  1.7?  of  all  memory  references  (instruction 
fetch  and  data  read  or  write)  go  to  list  structure  space.  We  give  the 
figures  for  several  example  programs  below.  The  numbers  refer  to  the 
fraction  of  the  total  number  of  memory  references  made  by  the  given 
program  to  the  particular  type  of  page.  "R/W"  signifies  read/write 
references  to  the  page,  "Instruction  fetch"  indicates  references  to 
memory  to  obtain  instructions.  The  page  types  are  indicated  as  follows: 


MACRO:  instruction  portion  of  hand  coded  assembly  language  kernel 
CCDAP:  compiled  Code  and/or  arrays 

ASC&V :  constants  and  temporary  storage  associated  with  MACRO 

PSTAK:  the  variable  binding  PDL 

CSTAK :  the  control  PDL 

LISTS:  CONS  cell  pages 

ATOMS:  atom  header  pages 

PNAME:  pages  containing  the  print  name  character  strings  for  atoms 
NUMS:  fixed  and  floating  point  numbers 

PAGED:  the  accumulators  (registers)  and  the  UUO  trap  location 
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PARSEMEASURE  i 

CONGENSIM 

!  COMPILEMEASURE | 

EDIT/CLEANUP 

SUBNET 

Total  Instruction 

fetch : 

1 

1 

1 

1 

.563 

.565 

i 

1 

.570  ! 

.569 

.482 

Total  R/W:  ! 

1 

1 

1 

1 

.  4  37 

.435 

1 

1 

.4  30 

.431 

.518 

MACRO:  Instruction 

fetch 

1 

1 

1 

1 

.514  ! 

.452 

1 

1 

.438  ! 

.444 

.362 

CCDAR:  Instruction 

fetch 

1 

1 

1 

I 

.040  ! 

.095 

1 

• 

.  121  ! 

.113 

.  108 

MACRO:  R/W  ! 

1 

1 

1 

1 

.017  ! 

.032 

1 

1 

.  025 

.  030 

.  108 

CCDAR:  R/W  | 

1 

1 

1 

1 

.009  ! 

.024 

1 

! 

.022  I 

.023 

.041 

ASC&V :  R/W  ! 

1 

1 

1 

1 

.058 

.096 

1 

< 

.074  ! 

.  094 

.068 

PSTAK:  R/W  j 

1 

1 

1 

1 

.110  ! 

.067 

I 

1 

.097  ! 

.066 

.076 

CSTAK :  R/W  1 

1 

1 

1 

1 

.090  ! 

.117 

1 

1 

.095  j 

.  106 

.082 

LISTS:  R/W  ! 

1 

1 

1 

1 

.012  ! 

.013 

1 

1 

.015  ! 

.010 

.016 

ATOMS:  R/W  | 

1 

1 

1 

1 

.022  ! 

.003 

1 

1 

.003  ! 

.004 

.003 

AT0MHT :  R/W  | 

1 

1 

1 

l 

.000  ! 

.000 

1 

1 

.000  ! 

.001 

.000 

PNAME:  R/W  ! 

1 

1 

1 

1 

.001  ! 

.000 

1 

1 

.001  ! 

.004 

.001 

NUMS:  R/W  | 

1 

1 

1 

1 

.000  ! 

.000 

1 

1 

.ooi  ; 

.000 

.000 

PAGE0:  Instruction 

fetch 

1 

1 

i 

i 

.009  ! 

.018 

I 

1 

.011  ! 

.013 

.012 

PAGE0:  R/W  (registers,  UUO 

word ) 

.118  ! 

.083 

1 

1 

.097  ! 

.092 

.  123 
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Detailed  instruction  fetch  measurements  on  MACRO  code  -  bottlenecks 

The  second  set  of  measurements  was  made  to  determine  exactly  where 
the  CPU  time  used  in  performing  typical  INTERLISP  tasks  is  spent.  Given 
that  the  vast  majority  of  the  INTERLISP  system  consists  of  compiled  LISP 
code  rather  than  hand-coded  assembly  language  (about  200k  words  of 
compiled  LISP  code  and  about  15k  words  of  hand-written  MACRO  code)  one 
might  expect  that  a  substantial  portion  of  the  computation  done  by  LISP 
consists  of  executing  compiled  code.  This  is  reinforced  by  the  fact 
that  over  half  of  the  memory  required  in  the  working  set  for  a  given 
program  is  in  the  compiled  code.  However,  as  revealed  by  the 
instruction  fetch  data  above,  approximately  80?  of  the  instructions 
being  executed  were  part  of  the  hand-coded  kernel  of  the  INTERLISP 
system  the  MACRO  code.  Thus,  we  decided  to  take  a  more  detailed  look  at 
the  distribution  of  instruction  fetches  in  the  hand-coded  kernel. 

The  simulator  was  modified  to  record  in  detail  the  pattern  of 
instruction  fetches  that  occurred  within  the  macro  code.  All  memory 
references  outside  the  range  occupied  by  the  hand-code  and  its  temporary 
data  storage  were  lumped  together.  Within  the  hand-code  area  fetch  and 
read/write  counts  wore  kept  for  contiguous  8-word  chunks  of  memory. 

,  While  if  would  have  been  somewhat  more  meaningful  to  record  data  in 
terms  of  functional  components  of  the  hand-code  (e.g.  particular 
sub  t'ou  t  i  nes )  ,  the  table  that  would  have  been  required  was  to  large,  and 
the  time  overhead  prohibitive.  The  use  of  8-word  chunks  allowed  us  to 
localize  references  sufficiently  to  determine  the  functional  chunks  by 
after-measurement  analysis. 
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The  resulting  data  produced  a  rather  strong,  and  to  some  people  a 
surprising  result.  If  the  8  word  chunks  were  ordered  (for  each  program) 
by  the  number  of  fetches  made  within  that  chunk,  then  for  all  programs 
measured  the  top  30  chunks  accounted  for  over  48%  of  the  total 
instruction  fetches  made  by  the  program.  In  fact,  the  average  over  12 
quite  different  types  of  programs  was  that  over  60$  of  the  instruction 
fetches  for  a  program  were  contained  within  the  program's  top  30  chunks. 

It  was  not  only  the  case  that  each  program  had  its  own  "top  30" 
chunks  -  the  union  of  the  sets  of  "top  30"  chunks  had  only  54  distinct 
chunks!  Moreover,  45  chunks  covered  over  50$  of  the  references  made  by 
all  of  the  programs.  Thus,  f ewer  than  350  words  of  hand-code  (possibly 
fewer  than  300  words  since  many  of  the  chunks  contained  obviously 
low-probability  code)  accounted  for  the  lion's  share  of  the  execution 
time  taken  by  INTERLISP. 

On  the  basis  of  this  data  we  were  able  to  pinpoint  a  small  number 
of  high-priority  portions  of  the  hand-code  to  optimize.  As  it  turned 
out,  there  was  extremely  high  agreement  between  the  data  and  the 
"educated  guesses"  of  the  knowledgeable  members  of  the  INTERLISP 
community  -  the  worst  offenders  had  been  predicted  ahead  of  time  by  many 
of  the  people  familiar  with  the  implementation,  and  there  were  almost  no 
qualitative  surprises  -  only  the  sheer  concentration  of  the  instruction 
fetches  was  surprising. 

While  the  exact  core  location  and  time  spent  are  useful  to  the 
systems  programmers  in  determining  what  words  of  the  MACRO  code  should 
be  carefully  tightened,  this  level  of  detail  seems  unnecessary  for  this 


42 


BBN  Report  No.  3331 


Bolt  Beranek  and  Newman  Inc. 


report.  Thus,  we  will  give  primarily  the  highlights  of  the  results. 
The  gory  details  will  be  made  available  to  those  who  request  them. 

The  single  largest  bottleneck  in  the  system  turned  out  to  be  the 
procedure  for  looking  up  variable  bindings  on  the  stack.  This  took  up 
between  10%  and  45%  of  the  total  instructions  executed,  with  an 
"average"  (weighted  equally  over  all  measured  programs)  of  over  20%. 
Programs  which  were  block  compiled  tended  to  have  the  lower  values  of 
time  spent  in  variable  lookup,  but  still  substantial  amounts.  The  next 
greatest  amount  of  time,  averaging  9%,  of  the  instruction  fetches,  lay 
in  the  function  calling  sequence,  followed  by  about  8%  of  instruction 
fetches  in  the  type  checking  routines.  If  the  time  spent  in  the  UUO 
word  and  UUO  dispatcher  are  added  to  these  times,  the  total  time  spent 
in  the  function  call  and  type  checking  bottleneck  is  almost  20%  of  the 
instruction  fetches.  The  next  big  bottleneck  is  the  binding  of 
variables  on  entry  to  a  function,  and  this  takes  about  5.6%  of  the 
instruction  fetches.  Finally,  to  no  ones  surprise,  the  CONS  routine 
takes  about  5%  of  the  instruction  fetches.  This  is  certainly  high  for 
fewer  than  thirty  words  of  code,  but  it  is  not  as  bad  as  many  people 
thought,  given  the  complexity  of  the  INTERLISP  CONS  algorithm. 
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Distribution  of  instruction  fetch  references  for  several  programs 
(Data  from  top  30  chunks,  functionally  distributed) 


Function 
Call  | 

PARSEMEASURE 

1 

1 

Entry 

PDL  search 

!  type  checking  i 
1  1 
1  1 

CONS 

1 

i 

.073  ! 

COMPILEMEASURE 

.039  ! 

1 

1 

.455  (!) 

!  .084  i 

1  1 

1  1 

.024 

I 

.050  ! 

DWIMIFY 

.130  ! 

.232 

;  .126  i 

.074 

.057  ! 

•  *Q ) 

WEST 

.092  ! 

.  107 

1  .133  ! 

.008 

!  .071 (IUB 

.146 

**Q ) 

COMNASAGRAMMAR 

.117  ! 

.  107 

!  .104  ! 

.049 

J . 01 7 ( IUB 

.140  ! 

TASKINLS 

.128  ! 

.214 

!  .050  ! 

.055 

.100  ! 

.050  1 

.  157 

!  .073  ! 

.061 

av.  .094 

.093 

.212 

.082 

.045 

Brief  Program  Descriptions : 

PARSEMEASURE 

June  1975  version  of  L.  Bates'  parser  for  the  BBN  speech  understanding 
system,  parsing  a  short  sentence.  Program  not  highly  tuned. 

COMPILE  MEASURE 

Compilation  of  9  short  and  medium  size  functions  from  in-cure 
definitions  -  compilation  results  stored  in  core  and  on  a  file.  Program 
coded  by  systems  personnel  and  carefully  tuned. 

DWIMIFY 

Application  of  error  correction  function  DWIMIFY  to  medium-size  function 
containing  CLISP  expressions.  Program  carefully  tuned  and  coded  by 
system  personnel. 

WEST 

Early  version  of  a  CAI  program  to  teach  arithmetic.  Coded  by 
non-systems  personnel  using  a  highly-modular ,  functionally  decomposed 
style . 

COMNASAGRAMMAR 

Compiled  version  of  ATN  parser  from  the  LUNAR  natural  language  system. 
Code  produced  by  grammar-compiler. 

TASKINLS 

LISP  simulation  of  NLS  system  under  control  of  a  CAI  lesson  monitor  arid 
evaluator.  Block-compiled  system,  moderately  tuned. 
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Detailed  description  of  Uie  operation  of  the  simulator  and  analyzer 

We  present  below  some  fine  details  regarding  the  simulator  and  the 
way  that  page  faulting  data  was  analyzed.  We  hope  that  this  information 
might  be  useful  to  anybody  wishing  to  further  analyze  or  interpret  the 
data  given  in  this  report. 

Since  the  simulator  increases  the  CPU  time  needed  to  perform  an 
operation  by  a  factor  of  from  MO  to  80,  it  is  tempting  to  extract  as 
much  data  as  possible  during  a  run  of  the  simulator.  This  data  can  then 
be  processed  by  any  number  of  analysis  programs  to  provide  various 
characterizations  of  the  operation  of  the  program  in  executing  the  given 
job.  However,  there  is  a  time/space  tradeoff  that  arises  that  limits 
the  amount  of  raw  data  that  can  be  collected.  Conceivably,  one  could 
write  out  on  a  file  the  entire  sequence  of  instructions  executed  and  the 
memory  references  made  during  the  execution  of  a  given  user  program. 
While  this  would  give  a  complete  record  of  the  computational  activity  of 
the  program,  it  is  unfeasible  for  any  but  very  short  jobs  -  on  a  machine 
which  normally  executes  300,000  to  BOO, 000  instructions  per  second,  a 
few  seconds  of  CPU  time  of  the  user  job  would  produce  enough  data  to 
fill  an  entire  magnetic  tape!  Additionally,  the  i/o  time  needed  to  write 
out  the  volume  of  page  reference  data  would  be  prohibitive. 

Thus,  the  alternative  tack  was  taken  -  certain  measures  of  the 
memory  referencing  activity  were  abstracted  during  the  simulation  and 
then  written  out  to  be  later  analyzed.  In  all  cases,  parameters  were 
accumulated  for  a  quantum  of  P0M8  memory  references,  and  then  the 
abstracted  data  were  written  out.  Two  distinct  measures  were  made.  The 
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first  measure  was  made  in  order  to  determine  the  page  referencing 
activity  of  INTERLISP  -  this  is  the  raw  data  used  to  determine 
properties  of  the  INTERLISP  "working  set".  To  obtain  this  measure,  the 
page  number  was  obtained  for  every  reference  to  memory  (including  those 
occurring  during  indirect  reference  chains).  Two  tables  were  kept,  one 
containing  the  number  of  "instruction  fetch"  references  to  each  page, 
and  the  other  containing  the  number  of  read/write  references.  The 
reference  counts  were  accumulated  during  a  quantum  (2 048  total 
references)  and  then  a  record  was  written  out  indicating  all  pages  which 
had  been  referenced  during  the  quantum,  and  the  number  of  read/write  and 
fetch  references  actually  made.  In  addition,  the  INTERLISP  type  table 
was  saved  for  the  given  job,  giving  a  record  of  the  "type"  of  the  page 
(i.e.  whether  it  contained  MACRO  code,  stack,  lists,  atom  headers, 


compiled  cu 

'de  and  arrays, 

etc  .  ) 

All  measurements  were 

made  under 

conditions 

in  which  no 

garbage 

collect 

ions 

(which  can 

cause  page 

shuffling ) 

would  occur,  so 

that  the 

single 

type 

table  was  su 

f f icient  to 

record  the 

characteris  tics 

of  each 

page . 

An  added  degree  of  su 

btlety 

had  to 

be 

taken  into 

account  in 

recording  page  references,  because  of  the  "code  swapping"  or  "compiled 
code  overlay"  facility  of  INTERLISP.  INTERLISP  maintains  one  (and 
potentially  several)  "lower  forks"  in  which  it  stores  compiled  Code.  A 
segment  of  the  basic  512k  address  space  (generally  64  pages  of  512  words 
each)  is  reserved  as  a  "swapping  buffer".  By  use  of  PMAP's  this  buffer 
is  used  to  window  sections  of  the  lower  fork(s)  to  run  code,  and 
therefore  a  reference  to  a  "real"  page  in  the  swapping  buffer  is  in 
actuality  a  reference  to  some  "virtual"  page  in  the  lower  fork.  Thus, 
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the  potential  address  space  of  an  INTERLISP  program  is  not  limited  to 
the  512  pages  directly  addressable  under  TENEX  -  it  can  be  indefinitely 
large,  though  in  practice  it  is  currently  limited  to  96 0  pages  (1024  for 
two  forks,  minus  64  pages  in  the  swapping  buffer).  It  was  decided  to 
record  the  "virtual  page"  touched  by  each  memory  reference,  so  that  we 
could  tell  which  compiled  code  was  being  used,  rather  than  simply  what 
pages  in  the  swapping  buffer  were  being  used  to  window  compiled  code. 
An  added  complication  is  that  the  assignment  of  pages  in  the  lower  fork 
to  pages  in  the  swapping  buffer  is  dynamically  variable,  and  so  the 
simulator  must  make  use  of  the  INTERLISP  swapper's  tables  to  convert 
each  reference  to  the  swapping  buffer  to  the  current  page  reference  in 
the  lower  fork. 

In  the  section  on  Page  Faulting  vs.  Working  Set  Size  we  indicated 
our  use  of  a  simplified  page  management  algorithm  (LRU)  to  replace  the 
page  management  procedures  actually  used  by  TENEX.  To  make  it  possible 
to  obtain  cage  faulting  behavior  for  different  working  set  sizes  with 
just  a  single  pass  over  the  data  from  the  simulator,  we  make  use  of  a 
related  concept,  the  "distance  string",  rather  than  directly  simulating 
the  LRU  algorithm. 

Given  a  sequence  of  page  references,  the  correspon ding  distance 
string  is  a  sequence  of  numbers  which  gives,  for  each  reference,  the 
number  of  distinct  pages  which  have  been  referenced  since  the  last  time 
the  given  page  was  referenced.  Thus,  given  an  LRU  algorithm,  for  a 
fixed  working  set  size  all  page  references  which  have  a  distance  string 
value  greater  than  the  working  set  size  will  cause  faults,  and  all 
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references  with  lower  distances  will  not  fault.  This  permits  one  to 
make  a  single  run  through  the  distance  string  file  and  compute  the 
number  of  faults  for  any  number  of  different  working  set  sizes. 

Most  of  our  data  comes  in  quantized  sets  of  20^8  memory  references, 
and  thus  we  only  know  the  time  of  reference  of  a  page  to  within  2 0M8 
memory  cycles.  Because  of  this  we  must  use  an  approximation  to  the 
distance  string  algorithm.  The  resulting  analysis  of  our  data  is  not 
exactly  equivalent  to  the  results  of  the  simple  LRU  algorithm  described 
above.  For  each  page,  we  compute  the  number  of  distinct  pages  which 
have  been  referenced  since  the  last  quantum  in  which  the  given  page  was 
referenced.  We  include  in  that  count  all  pages  referenced  in  the 
quantum  when  the  given  page  was  previously  referenced  which  have  not 
been  referenced  in  the  intervening  quanta. 
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We  have  compared  the  2048  memory  reference  quantum  data  with  data 
taken  with  a  quantum  of  128  memory  references  (in  which  the  average 
number  of  page  references  is  slightly  less  than  10  per  quantum).  The 
graphs  of  a  few  of  these  runs  are  given  below  for  comparison.  The 
calculation  of  number  of  page  faults  for  a  given  working  set  size  is 
substantially  the  same  (within  a  2?  range)  for  both  quantum  sizes,  until 
the  working  set  drops  below  56  pages.  This  is  an  indication  that  the 
distance  string  values  greater  than  56  pages  are  quite  accurate  for  the 
large  quantum  data,  and  since  we  are  not  extremely  interested  in  the 
behavior  of  INTERLISP  below  about  75  pages  (at  which  point  it  is  already 
page-faulting  almost  every  millisecond  -  a  ridiculously  high  rate),  the 
large  quantum  data  is  sufficient  to  characterize  the  paging  performance 
of  INTERLISP. 

Some  of  the  reasons  why  the  large  quantum  approximation  is  likely 
to  be  fairly  accurate  for  distance  string  values  above  50  are: 

a)  On  the  average  there  are  about  25  pages  referenced  in  each 
quantum,  and  data  indicates  that  10  to  15  of  those  are  referenced 
in  almost  every  quantum.  Thus,  for  distance  string  values  greater 
than  50  -  two  quanta  of  references  at  least  -  the  number  of  pages 
in  the  "previous  reference  quantum"  which  are  not  referenced  in  the 
intervening  quanta  is  almost  certain  to  be  less  than  10. 

b)  The  data  indicate  that  over  90?  of  the  distance  string  values 
are  below  60,  so  that  for  a  page  with  distance  string  value  over 
60,  chances  are  that  the  contribution  from  its  "previous  reference 
quantum"  is  less  than  10?  of  the  number  of  page  references 
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originally  in  that  quantum,  since  the  other  90?  of  those  pages  also 
occur  in  at  least  one  of  the  intermediate  quanta.  Thus,  the 
variation  due  to  counting  all  of  the  remaining  pages  in  the 
previous  reference  quantum  is  on  the  order  of  10?  of  the  number  of 
pages  in  the  quantum. 
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Data  from  run  of  the  dwimif ication .  etc .  example  using  a  128  memory 
reference  quantum 


Example:  DWCL 1 28 

1204224  Memory  references  in  example 


Allowed 

Page 

Page 

Working 
Sci  t 

Faults 

248 

240 

« 

240 

240 

« 

232 

241 

« 

224 

245 

* 

216 

248 

* 

208 

253 

* 

200 

260 

* 

192 

284 

» 

184 

314 

« 

176 

322 

* 

168 

334 

* 

160 

354 

* 

152 

375 

* 

144 

402 

* 

136 

420 

* 

128 

471 

120 

569 

112 

619 

104 

712 

5'; 

829 

88 

980 

80 

124  1 

72 

1541 

64 

1916 

56 

2486 

48 

3431 

40 

5269 
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Data  f rom  run  of  compilation  using  1 28  memory  reference  quantum 

Example:  SMALL  1 28C0M 

2037760  Memory  references  in  example 


Allowed 

Page 

Page  faults  per  20000  memory  references 

Working 

Faults 

Set 

2*40 

236 

* 

23  2 

238 

* 

224 

241 

* 

216 

244 

* 

208 

251 

« 

200 

279 

« 

192 

290 

» 

184 

293 

* 

176 

302 

* 

168 

309 

• 

160 

322 

* 

152 

344 

* 

144 

368 

* 

136 

422 

« 

128 

521 

ft 

120 

786 

* 

112 

898 

• 

104 

963 

• 

96 

1035 

» 

88 

1 1  32 

• 

80 

1256 

* 

72 

1425 

* 

64 

1675 

« 

56 

2137 

* 

48 

2707 

40 

4139 
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Other  page-fault  versus  working  set  curves 


Example:  COMPILEMEASURE 


/ 


292864 0  Memory  references  in  example 


lowed 

Page 

Page  faults 

rking 

Faults 

t 

240 

233 

* 

232 

234 

* 

224 

236 

« 

216 

240 

» 

208 

248 

* 

200 

285 

» 

192 

296 

» 

184 

300 

» 

176 

310 

» 

168 

321 

« 

160 

335 

* 

152 

355 

• 

144 

388 

* 

136 

474 

« 

128 

659 

* 

120 

945 

« 

112 

1067 

ft 

104 

1156 

» 

96 

1232 

* 

88 

1  342 

* 

80 

1514 

« 

72 

1731 

64 

2045 

56 

2544 

48 

3706 

40 

6847 

10 


15 


20 
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Example:  EDIT/CLEANUP 

8392704  Memory  references  in  example 


Allowed 

Page 

Working 

Set 

Faults 

320 

317 

* 

312 

317 

« 

304 

319 

« 

296 

339 

* 

2  88 

346 

» 

280 

354 

* 

272 

366 

* 

264 

377 

* 

256 

396 

» 

248 

415 

* 

240 

460 

« 

232 

485 

« 

224 

523 

» 

216 

561 

« 

208 

602 

ft 

200 

657 

ft 

192 

723 

» 

184 

796 

* 

176 

875 

ft 

168 

956 

« 

160 

1067 

« 

152 

1225 

« 

144 

1391 

• 

136 

1547 

• 

128 

1807 

120 

2112 

112 

2440 

104 

2914 

96 

3562 

88 

4348 

80 

5530 

72 

7386 

64 

1004  1 

56 

14020 

48 

22092 

40 

52392 

Page  faults  per  20000  memory  references 
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Example:  NLSPARSE 

473088  Memory  references  in  example 


Allowed 

Page 

Page  faults  per  20000  memory  reference 

Working 

Faults 

Set 

205 

208 

211 

* 

200 

211 

* 

192 

21  1 

* 

184 

215 

• 

176 

216 

• 

168 

219 

* 

160 

221 

« 

152 

226 

» 

144 

231 

» 

136 

233 

* 

128 

258 

# 

120 

273 

• 

112 

291 

* 

104 

305 

* 

96 

222 

« 

88 

342 

* 

80 

378 

* 

72 

458 

* 

64 

603 

56 

785 

48 

1200 

40 

2833 

0 
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Example:  PARSEMEASURE 

*1333568  Memory  references  in  example 


Allowed 

Page 

Working 

Faults/ 

Set 

20000  memory  references 

288 

2  87 

« 

280 

289 

* 

272 

289 

« 

264 

2  89 

« 

256 

289 

* 

248 

289 

« 

240 

291 

« 

232 

295 

» 

224 

301 

« 

216 

312 

208 

317 

200 

327 

« 

102 

342 

« 

184 

365 

« 

176 

406 

• 

168 

453 

« 

160 

406 

« 

152 

545 

• 

144 

583 

• 

136 

625 

« 

128 

837 

» 

120 

1227 

• 

112 

1390 

• 

104 

1549 

• 

96 

1892 

88 

2173 

80 

2577 

72 

3329 

64 

5236 

56 

7732 

48 

12129 

40 

29041 
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